Traditional financial data websites usually shy away from giving recommendations or advising users on investing decisions. While in depth studies are often conducted and used in academia or research, the outputs of these studies are often not accessible nor timely enough for the regular user. This means that these websites and their applications are limited in their use-cases. This can be due to issues arising from financial or legal liability if users act on the recommendations coming from these financial websites. As such we feel that this a gap between what is available publicly and user requirements.
This motivation of building this submodule and its functionality is to bridge the gap between user requirements and current market offerings and its main objective of this sub-section is to allow users to have a forecast of how the stock prices are likely to move in order to identify potential entry/exit points.
Time series data is a sequential collection of numerical data points. Particularly for investing, a time series data-set tracks the movement of a chosen index over a period of time with readings recorded at regular intervals (e.g. daily, monthly, yearly).
Time series analysis allows users to uncover meaningful information and trends in data collected over time. Time series forecasting on the other hand, uses information contained within historical values and associated patterns to forecast future movements. Most often, we look at trends, cyclical fluctuations and issues of seasonality.
As a demonstration of the time series analysis and the forecasting capabilities of the proposed Shiny Application: The company APPLE will be used as an example for this report.
For the purposes of this assignment, we will use the time series data from one single company - Apple (Ticker: AAPL), as an example to illustrate the capabilities and functions of the application. There is no methodical reason to explain why Apple is used for the example other than the fact that it is one of the most recognisable companies in the market. The final product will be scaled up considerably to include more stocks that the users can choose from in order to perform time series forecasts on.
In the literature review conducted, there are two key methods for forecasting future stock prices. Firstly - fundamental analysis, which uses the information provided in a company’s financial statement and annual reports, and secondly technical analysis, which uses past trends in the stock market.
In this report, historical prices are sole data points used to predict the movement in stock prices. While similar to technical analysis in using past data, time series forecasting is not the same as technical analysis and can be seen as a natural extension/logical next step after conducting technical analysis. The main difference is that time series forecasting gives you an exact forecasted price, while technical analysis only predicts the future movement (up/down) of the price (Berdiell, 2015).
The fundamental idea of this method is to seek out patterns in the historical stock prices with a hybrid approach. A hybrid approach combine multiple different models to forecast stock prices. For example, the papers of Markowska-Kaczmar and Dziedzic (2008) and Wang et al (2015) both proposed tackling stock price forecasting with an amalgamation of multiple models instead of just relying on one form of forecasting. This has been shown to result in superior forecasting accuracy and performance as compared to using a dedicated forecasting method. The researches of Dey et al. (2016) has also shown similar results albeit with some overfitting issues and a limited testing scenario.
With the literature review conducted in mind, the application that we built will draw on the researches conducted by the others beforehand to build a more robust hybrid forecasting model. The modification that is made with regards to the existing hybrid techniques is that the application will do a forecast based on 5 different model then present the mean of the 3 best performing ones. This means that not all models will make the cut in the final forecast presented to the users.
First, we run this first line of code to clear the environment and remove existing R objects (if any)
rm(list=ls())
The next code chunk checks if required packages are installed. If they are not installed, the next line of code will install them. The following line is then used to import the libraries into the current working environment and ready for use.
packages = c('sf','tmap','tidyverse','forecast',
'tseries','readxl','tidyquant',
'dygraphs','TSstudio','plotly',
'tsibble','ggplot2','tidymodels',
'modeltime','modeltime.ensemble',
'timetk','glmnet','randomForest')
for (p in packages) {
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
## Loading required package: sf
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## Loading required package: tmap
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Loading required package: tseries
## Loading required package: readxl
## Loading required package: tidyquant
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Loading required package: PerformanceAnalytics
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
## == Need to Learn tidyquant? ====================================================
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
## Loading required package: dygraphs
## Loading required package: TSstudio
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: tsibble
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:zoo':
##
## index
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
## Loading required package: tidymodels
## -- Attaching packages -------------------------------------- tidymodels 0.1.2 --
## v broom 0.7.5 v recipes 0.1.15
## v dials 0.0.9 v rsample 0.0.9
## v infer 0.5.4 v tune 0.1.3
## v modeldata 0.1.0 v workflows 0.2.2
## v parsnip 0.1.5 v yardstick 0.0.8
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x yardstick::accuracy() masks forecast::accuracy()
## x scales::discard() masks purrr::discard()
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x xts::first() masks dplyr::first()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x xts::last() masks dplyr::last()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## Loading required package: modeltime
##
## Attaching package: 'modeltime'
## The following object is masked from 'package:TTR':
##
## growth
## Loading required package: modeltime.ensemble
## Loading required package: modeltime.resample
## Loading required package: timetk
## Loading required package: glmnet
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-1
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
The application will use open source and live data obtained through the tidyquant package.
In the code chunk below, we define the key parameters for the time series data-set ticker_data_daily. Once again - in the final application these parameters are user defined/selected.
The beginning date of the time series data from_date
The ending date of the time series data to_date
The ticker of the stock that we selected AAPL
from_date <- "2018-01-01"
to_date <- "2020-12-31"
ticker_selected <- "AAPL"
ticker_data_daily <- tq_get(ticker_selected,
get = "stock.prices",
from = from_date,
to = to_date)
Next we call the glimpse function to take a quick look at the table, data structures, data types and data format to ensure everything is imported as it should be. Particularly, the date column in a time series data-set must be corrected identified as date in order to move on with analysis and forecasting. Since the data has been imported correctly as seen below, there is no need for any further transformations and we can proceed to the next step.
## Rows: 755
## Columns: 8
## $ symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL~
## $ date <date> 2018-01-02, 2018-01-03, 2018-01-04, 2018-01-05, 2018-01-08, ~
## $ open <dbl> 42.5400, 43.1325, 43.1350, 43.3600, 43.5875, 43.6375, 43.2900~
## $ high <dbl> 43.0750, 43.6375, 43.3675, 43.8425, 43.9025, 43.7650, 43.5750~
## $ low <dbl> 42.3150, 42.9900, 43.0200, 43.2625, 43.4825, 43.3525, 43.2500~
## $ close <dbl> 43.0650, 43.0575, 43.2575, 43.7500, 43.5875, 43.5825, 43.5725~
## $ volume <dbl> 102223600, 118071600, 89738400, 94640000, 82271200, 86336000,~
## $ adjusted <dbl> 41.38024, 41.37303, 41.56522, 42.03845, 41.88231, 41.87751, 4~
We also drop the unneeded columns as we will be looking specifically at the adjusted close prices for this exercise.
Why do we use adjusted close prices?: The adjusted closing price amends a stock’s closing price to accurately reflect that stock’s value after taking corporate actions (e.g. stock splits, dividends) into account. These corporate actions can affect the closing prices. It is often used when examining historical returns or doing a detailed analysis of past performance.
ticker_daily_adjclose <- subset(ticker_data_daily, select = -c(3,4,5,6,7))
We can now plot a timeseries chart for a visual analysis of the data. The chart below shows the trend of stock prices for the years 2018-2020. The slider below it also allows users to zoom in on a data range to have a closer look at the data on a day to day basis. In the Shiny application where users can select the date-range of the data to import, the slider will come in handy when selected date ranges are big.
ticker_daily_adjclose %>%
plot_time_series(date, adjusted,
.smooth_size = 0.2,
.title = "AAPL Adjusted Closing Prices 2018-2020",
.interactive = TRUE,
.plotly_slider = TRUE)
Time Series Plot for AAPL Stock Adjusted Closing Prices.
Before we go into time series forecasting proper. We will need to further breakdown the data into a training dataset to build the model and a testing dataset which will be used to determine the efficacy of the model. For the purposes of this paper, we will use 3 months worth of data for the testing dataset but this can be changed. This can be changed by the user but this in turn will affect the model outputs in terms of accuracy and amount of data available for training the models.
The following code chunk is then used to split the data into training and testing sets.
splits <- time_series_split(ticker_daily_adjclose, assess = "3 months", cumulative = TRUE)
## Using date_var: date
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(date, adjusted,
.title = "Time Series Cross Validation Plan",
.interactive = TRUE)
Splitting the time series dataset
As some of the models that we are using are specifically machine learning models, we create a Feature Engineering Recipe that can be applied to the data in order to create features that machine learning models can utilize.
recipe_spec <- recipe(adjusted ~ date, training(splits)) %>%
step_timeseries_signature(date) %>%
step_rm(matches("(.iso$)|(.xts$)")) %>%
step_normalize(matches("(index.num$)|(_year$)")) %>%
step_dummy(all_nominal()) %>%
step_fourier(date, K = 1, period = 12)
recipe_spec %>% prep() %>% juice()
## # A tibble: 694 x 42
## date adjusted date_index.num date_year date_half date_quarter
## <date> <dbl> <dbl> <dbl> <int> <int>
## 1 2018-01-02 41.4 -1.73 -1.15 1 1
## 2 2018-01-03 41.4 -1.72 -1.15 1 1
## 3 2018-01-04 41.6 -1.72 -1.15 1 1
## 4 2018-01-05 42.0 -1.72 -1.15 1 1
## 5 2018-01-08 41.9 -1.71 -1.15 1 1
## 6 2018-01-09 41.9 -1.70 -1.15 1 1
## 7 2018-01-10 41.9 -1.70 -1.15 1 1
## 8 2018-01-11 42.1 -1.70 -1.15 1 1
## 9 2018-01-12 42.5 -1.69 -1.15 1 1
## 10 2018-01-16 42.3 -1.68 -1.15 1 1
## # ... with 684 more rows, and 36 more variables: date_month <int>,
## # date_day <int>, date_hour <int>, date_minute <int>, date_second <int>,
## # date_hour12 <int>, date_am.pm <int>, date_wday <int>, date_mday <int>,
## # date_qday <int>, date_yday <int>, date_mweek <int>, date_week <int>,
## # date_week2 <int>, date_week3 <int>, date_week4 <int>, date_mday7 <int>,
## # date_month.lbl_01 <dbl>, date_month.lbl_02 <dbl>, date_month.lbl_03 <dbl>,
## # date_month.lbl_04 <dbl>, date_month.lbl_05 <dbl>, date_month.lbl_06 <dbl>,
## # date_month.lbl_07 <dbl>, date_month.lbl_08 <dbl>, date_month.lbl_09 <dbl>,
## # date_month.lbl_10 <dbl>, date_month.lbl_11 <dbl>, date_wday.lbl_1 <dbl>,
## # date_wday.lbl_2 <dbl>, date_wday.lbl_3 <dbl>, date_wday.lbl_4 <dbl>,
## # date_wday.lbl_5 <dbl>, date_wday.lbl_6 <dbl>, date_sin12_K1 <dbl>,
## # date_cos12_K1 <dbl>
We will then create a few forecasting models that the application will use to forecast the forward prices. More literature can be found in the hyperlinks below and the code chunk follows below.
The models that we have included so far are:
# 1 ARIMA
model_spec_arima <- arima_reg() %>%
set_engine("auto_arima")
wflw_fit_arima <- workflow() %>%
add_model(model_spec_arima) %>%
add_recipe(recipe_spec %>% step_rm(all_predictors(), -date)) %>%
fit(training(splits))
## frequency = 5 observations per 1 week
# 2 Prophet
model_spec_prophet <- prophet_reg() %>%
set_engine("prophet")
wflw_fit_prophet <- workflow() %>%
add_model(model_spec_prophet) %>%
add_recipe(recipe_spec %>% step_rm(all_predictors(), -date)) %>%
fit(training(splits))
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
# 3 Elastic Net
model_spec_glmnet <- linear_reg(
mixture = 0.9,
penalty = 4.36e-6
) %>%
set_engine("glmnet")
wflw_fit_glmnet <- workflow() %>%
add_model(model_spec_glmnet) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits))
# 4 Random Forest
model_spec_rf <- rand_forest(trees = 500, min_n = 50) %>%
set_engine("randomForest")
wflw_fit_rf <- workflow() %>%
add_model(model_spec_rf) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits))
# 5 Boosted Prophet
model_spec_prophet_boost <- prophet_boost() %>%
set_engine("prophet_xgboost", daily.seasonality = TRUE)
wflw_fit_prophet_boost <- workflow() %>%
add_model(model_spec_prophet_boost) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
## Warning: The following arguments cannot be manually modified and were removed:
## daily.seasonality.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
Next we then create a modeltime_table using the modeltime package. This is to allow for easier calibration and refitting at the later parts of the forecasting.
ticker_models <- modeltime_table(
wflw_fit_arima,
wflw_fit_prophet,
wflw_fit_glmnet,
wflw_fit_rf,
wflw_fit_prophet_boost
)
ticker_models
## # Modeltime Table
## # A tibble: 5 x 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <workflow> ARIMA(0,1,1)(0,0,1)[5] WITH DRIFT
## 2 2 <workflow> PROPHET
## 3 3 <workflow> GLMNET
## 4 4 <workflow> RANDOMFOREST
## 5 5 <workflow> PROPHET W/ XGBOOST ERRORS
Next we calibrate the model using the test data that we obtained by splitting the full time series dataset into training and testing sets. Using the codechunk below, we also plot the forecasts of all 5 models against that of the actual test data. This will give us a rough idea of the accuracy of model.
calibration_table <- ticker_models %>%
modeltime_calibrate(testing(splits))
calibration_table %>%
modeltime_forecast(actual_data = ticker_daily_adjclose) %>%
plot_modeltime_forecast(.interactive = TRUE,
.plotly_slider = TRUE)
## Using '.calibration_data' to forecast.
Of course, it is obvious that not all models workout. From the previous chart, we can tell that the Random Forest model and the GLMNet Model are producing results way off from the test data set. This means that we should consider dropping those models while building the final application. We then build a modeltime_accuracy table to determine which models to drop quantitatively. Using MAPE as a measure, we will select only the top 3 models in building the ensemble model, which means that in this scenario, we will drop the Random Forest and GLMNet Models. Depending on their preferences, users will be able to select which accuracy measure is used to filter away inaccurate models.
When we present the time series forecasting results, we will only present a unified/aggregated set of forecasts from the ensemble model to avoid confusing the user. In our application, the ensemble model uses the mean of the forecast of the models with the top 3 MAPE measures. The mean of the forecasts will then be shown in the chart as the expected future share price of the stock that the users will use for decision making. The ensemble model approach is used to combine the strengths and eliminate the weaknesses of each model by getting the mean the 3 remaining models.
## -- Modeltime Ensemble -------------------------------------------
## Ensemble of 3 Models (MEAN)
##
## # Modeltime Table
## # A tibble: 3 x 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <workflow> ARIMA(0,1,1)(0,0,1)[5] WITH DRIFT
## 2 2 <workflow> PROPHET
## 3 5 <workflow> PROPHET W/ XGBOOST ERRORS
Next we plot the ensemble model results against the test data. From the plot, we can see that the trend of the forecast is largely similar to that of the test data.
Finally, we refit the ensemble model results against the full dataset in order to forecast forward, this is done using the model_refit function. We can also see the confidence intervals of the future prices. The larger the confidence interval is will indicate a more volatile stock price. This is also the final visual that users will interact with when using the shiny application. The steps taken to get to this point are largely transparent to the user and will not be shown.
## frequency = 5 observations per 1 week
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Warning: The following arguments cannot be manually modified and were removed:
## daily.seasonality.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
As we first started out building and conceptualising the Shiny application. One of the key ideas of interactivity within the shiny application was to allow the users to choose their choice of model for forecasting. This would not have worked very well because clearly some models worked better and are more accurate than the others.
Thus we decided to remove that optionality and put in an ensemble model instead. From the screenshots below, we can see that the ensemble model works well to reduce the confidence interval, giving the user more certainty in price movements. The results of the ensemble model also trend very closely with the test data which will gives us confidence that it is working well.
Comparison of having individual models VS an ensemble model
Including a table for model accuracy and all the different accuracy measures will also help the user choose a better forecasting model. While the current report chooses MAPE as its accuracy measure, other investors might prefer different accuracy measures that can result in different forecasts. A short explanation will be included to describe each accuracy measure and how they differ from each other.
The number of models included in the ensemble model will remain at 3 in order to prevent the forecast from being distorted by unreliable models.
Input ticker of stock to be forecasting
Time period of time series data ingesting for forecasting
Accuracy Measure of Forecasted Models (e.g. MAE/MAPE/MASE/SMAPE/RMSE/RSQ)
Forecasting Period (between 1 month and 12 months)
The proposed design is as follows.
Handdrawn sketch of the Shiny Application sub-module.
https://blogs.oracle.com/ai-and-datascience/post/performing-a-time-series-analysis-on-the-sampp-500-stock-index#:~:text=Time%2Dseries%20analysis%20is%20a,to%20predict%20future%20stock%20values
https://machinelearningmastery.com/time-series-forecasting-performance-measures-with-python/
https://yiqiaoyin.files.wordpress.com/2017/05/time-series-analysis-on-stock-returns.pdf
https://core.ac.uk/download/pdf/43552332.pdf
https://www.sciencedirect.com/science/article/abs/pii/S0305048311001435?via%3Dihub